Pandas Practice Questions¶
This notebook contains 20 comprehensive Python pandas practice problems organized in two sections:
Section A - Short Coding Questions (Questions 1-17):
- Questions 1-12: Basic pandas operations (loading, selection, filtering, handling missing values)
- Questions 13-17: Short coding questions on duplicates, missing values, column creation, filtering, and statistics
Section B - Applied Coding Questions (Questions 18-20):
- Question 18: GroupBy with multiple aggregations
- Question 19: Advanced filtering and column creation
- Question 20: Handling missing values and outliers
Each question includes:
- Clear problem description
- Hints for solving
- Multiple-choice code options (where applicable)
- Instructor solution with inline examples
- Test cases using small DataFrames
import pandas as pd
import numpy as np
from io import StringIO
name = 'Anay Mittal'
roll_number = '2423357'
1. Load a CSV string into a DataFrame¶
Return: A pandas DataFrame from the CSV string
Choose the correct line:
- (a)
return pd.read_excel(StringIO(csv_string)) - (b)
return pd.read_csv(StringIO(csv_string)) - (c)
return pd.DataFrame(csv_string.split('\n')) - (d)
return csv_string.to_dataframe()
def load_csv_string(csv_string: str) -> pd.DataFrame:
return pd.read_csv(StringIO(csv_string))
# csv_data = 'name,age,score\nAlice,25,85\nBob,30,90\nCharlie,22,78'
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[1], line 1 ----> 1 def load_csv_string(csv_string: str) -> pd.DataFrame: 2 return pd.read_csv(StringIO(csv_string)) NameError: name 'pd' is not defined
- Get shape and column names
Return:A tuple of (number of rows, number of columns, list of column names)
Choose the correct code:
- (a) return (df.size, df.ndim, df.columns)
- (b) return (df.shape[0], df.shape[1], list(df.columns))
- (c) return df.info()
- (d) return (len(df), len(df.index), df.to_list())
def get_dataframe_info(df: pd.DataFrame) -> tuple:
return df.info()
3. Get the first n rows of a DataFrame¶
Return: DataFrame containing first n rows
Choose the correct code:
- (a)
return df.iloc[:n] - (b)
return df.head(n) - (c)
return df.nlargest(n, axis=0) - (d)
return df[:n:1]
def get_first_n_rows(df: pd.DataFrame, n: int) -> pd.DataFrame:
return df.head(n)
4. Get basic statistics for numeric columns¶
Return: A pandas DataFrame with descriptive statistics (using .describe())
def describe_numeric(df: pd.DataFrame) -> pd.DataFrame:
df.describe()
5. Select a single column as a Series¶
Return: A pandas Series for the specified column
def select_column(df: pd.DataFrame, col_name: str) -> pd.Series:
return df[col_name]
6. Filter rows where a column value exceeds a threshold¶
Return: A DataFrame containing only rows where column > threshold
Hint: Use boolean indexing df[df[col_name] > threshold] and .reset_index(drop=True) to reset row indices.
Choose the correct code:
- (a)
return df.filter(column=col_name, value=threshold) - (b)
return df.loc[df[col_name] > threshold] - (c)
return df[df[col_name] > threshold].reset_index(drop=True) - (d)
return df.query(f'{col_name} > {threshold}')
def filter_by_threshold(df: pd.DataFrame, col_name: str, threshold: float) -> pd.DataFrame:
return df[df[col_name] > threshold].reset_index(drop=True)
7. Count missing (NaN) values in each column¶
Return: A pandas Series with column names as index and count of NaN as values
Hint: Use .isnull().sum() to count missing values in each column.
def count_missing_values(df: pd.DataFrame) -> pd.Series:
Series.isnull()
Series.sum()
8. Drop rows containing any NaN values¶
Return: A DataFrame with all rows containing NaN removed
Hint: Use .dropna() to remove rows with missing values, then .reset_index(drop=True) to renumber rows.
def drop_rows_with_nan(df: pd.DataFrame) -> pd.DataFrame:
df.dropna()
df.reset_index(drop=True)
9. Fill missing values with the mean of the column¶
Return: A DataFrame where NaN values in numeric columns are replaced by column mean
Hint: Get numeric columns using .select_dtypes(), then use .fillna() with the column mean.
def fill_missing_with_mean(df: pd.DataFrame) -> pd.DataFrame:
df.select_types()
df.fillna()
10. Group by a column and calculate the mean of another column¶
Return: A DataFrame with grouped results (group column and mean)
Hint: Use .groupby(group_col)[agg_col].mean() and .reset_index() to convert to DataFrame.
def group_by_mean(df: pd.DataFrame, group_col: str, agg_col: str) -> pd.DataFrame:
pass
11. Merge two DataFrames on a common column¶
Return: A merged DataFrame (inner join on the specified key)
Choose the correct code:
- (a)
return left.join(right, on=on) - (b)
return pd.concat([left, right]) - (c)
return pd.merge(left, right, on=on, how='inner') - (d)
return left.combine(right)
def merge_dataframes(left: pd.DataFrame, right: pd.DataFrame, on: str) -> pd.DataFrame:
return pd.merge(left, right, on=on, how='inner')
12. Convert a column to datetime format¶
Return: A DataFrame where the specified column has been converted to datetime
Choose the correct code:
- (a)
df_copy[col_name] = df_copy[col_name].astype(datetime) - (b)
df_copy[col_name] = pd.to_datetime(df_copy[col_name]) - (c)
df_copy[col_name].convert_to_datetime() - (d)
df_copy[col_name] = datetime.strptime(df_copy[col_name], '%Y-%m-%d')
def convert_to_datetime(df: pd.DataFrame, col_name: str) -> pd.DataFrame:
df_copy[col_name] = pd.to_datetime(df_copy[col_name])
13. Drop Duplicate Rows¶
You have a DataFrame with duplicate rows. The command drop_duplicates on subset of columns named ['Name', 'Team'] is to be used.
def drop_duplicates_by_cols(df: pd.DataFrame) -> pd.DataFrame:
df.drop_duplicates(['Name','Team'])
sample_df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Alice', 'Charlie'], 'Team': ['X', 'Y', 'X', 'Z'], 'Salary': [50000, 55000, 50000, 60000]})
14. Fill Missing Values in a Column¶
Write a Python command to fill all missing values in the column 'College' with the text 'Unknown'.
def fill_missing_college(df: pd.DataFrame) -> pd.DataFrame:
df.['College']=['Unkown']
# sample_df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie'], 'College': ['IIT', None, 'NIT']})
Cell In[20], line 2 df.['College']=['Unkown'] ^ SyntaxError: invalid syntax
15. Create New Column with Percentage Increase¶
Given a DataFrame df with a 'Salary' column, write code to increase salary by 5% and store it in a new column 'UpdatedSalary'.
Hint: Multiply the Salary column by 1.05 to increase by 5%.
Choose the correct code:
- (a)
df['UpdatedSalary'] = df['Salary'] * 5 - (b)
df['UpdatedSalary'] = df['Salary'] * 1.05 - (c)
df['UpdatedSalary'] = df['Salary'] + 0.05 - (d)
df['UpdatedSalary'] = df['Salary'].apply(lambda x: x * 5)
def add_updated_salary(df: pd.DataFrame) -> pd.DataFrame:
df['UpdatedSalary'] = df['Salary'] * 1.05
# sample_df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie'], 'Salary': [50000, 55000, 60000]})
16. Filter Rows with Range Condition¶
Write Python code to select rows where 'Profit' is between 30 and 55 (inclusive).
Hint: Use boolean indexing with AND operator &.
def filter_profit_range(df: pd.DataFrame) -> pd.DataFrame:
pass
# sample_df = pd.DataFrame({'Name': ['A', 'B', 'C', 'D'], 'Profit': [25, 40, 60, 35]})
17. Get Summary Statistics¶
Write a Python command to show summary statistics (mean, median, std, min, max, etc.) for the entire DataFrame.
def get_summary_stats(df: pd.DataFrame) -> pd.DataFrame:
df.describe()
# sample_df = pd.DataFrame({'Age': [25, 30, 28, 35], 'Salary': [50000, 55000, 52000, 60000]})
18. GroupBy with Multiple Aggregations¶
You have a DataFrame with columns: Name, Team, Salary, Profit
Write Python code to:
- Group the data by Team
- aggregate average salary and total profit for each team
- return the result
Hint: Use .groupby() with .agg() for multiple aggregations.
Choose the correct code:
- (a)
df.groupby('Team').agg({'Salary': 'mean', 'Profit': 'sum'}) - (b)
df.groupby('Team')[['Salary', 'Profit']].agg(['mean', 'sum']) - (c)
df.group('Team').apply(lambda x: {'avg_salary': x['Salary'].mean(), 'total_profit': x['Profit'].sum()})
def groupby_team_agg(df: pd.DataFrame) -> pd.DataFrame:
df.groupby('Team').agg({'Salary': 'mean', 'Profit': 'sum'})
# sample_df = pd.DataFrame({'Name': ['A', 'B', 'C', 'D'], 'Team': ['X', 'X', 'Y', 'Y'], 'Salary': [50000, 55000, 52000, 53000], 'Profit': [45, 30, 60, 25]})
19. Advanced Filtering and Column Creation¶
Given a DataFrame with columns: Name, Score1, Score2
Write Python code to:
- Select only rows where Score1 > 40 AND Score2 > 50
- Create a new column AverageScore = mean of Score1 and Score2
- return dataframe with only the
[['Name', 'AverageScore']]
Hint: Filter first using boolean indexing, then add the new column, then select specific columns.
def advanced_filter_and_create(df: pd.DataFrame) -> pd.DataFrame:
# sample_df = pd.DataFrame({'Name': ['A', 'B', 'C', 'D'], 'Score1': [40, 55, 70, 30], 'Score2': [50, 65, 75, 35]})
20. Handle Missing Values and Outliers¶
You have a DataFrame with an 'Age' column containing missing values and outliers (Age > 100).
Write Python code to:
- Replace missing values with the median age
- Remove rows where Age > 100
- Return the cleaned DataFrame
Hint: Use .fillna() with median, then filter with boolean indexing.
def clean_age_data(df: pd.DataFrame) -> pd.DataFrame:
pass
# sample_df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'], 'Age': [25, np.nan, 105, 30, np.nan]})